My first published work using the R programming language was in The Holocene. In retrospect, there are some things I would change. So, let’s take the opportunity to look at some changes!

Looking back, the thing that now irks me about this model is that I didn’t show how it handled new data (data unused in model construction). In other words, I didn’t assess overfitting. Luckily, the tidymodels metapackage offers some convenient ways to assess overfitting through a simple process known as data splitting. I’ll start off by showing the original code (with only some slight modifications) from the article, and then I’ll walk you through assessing overfitting and performance on a new model. Finally, we’ll compare yearly δ¹³C CO₂ predictions from the new and old models. The conclusion (spoiler!) is that the old model is still very robust and compares nearly perfectly to one where overfitting was explicitly evaluated (and isn’t a problem whatsoever).

The Original Code

Library and Import Raw Data

You can download Supplemental File 1.txt here.

library(tidyverse)
library(ggtext)

mydata <- as_tibble(read.table("Supplemental File 1.txt", header=T))

mydata

# A tibble: 160 × 3
   age_AD d13C_CO2 source           
    <int>    <dbl> <chr>            
 1    753     -6.4 Bauska_et_al_2015
 2    774     -6.4 Bauska_et_al_2015
 3    795     -6.4 Bauska_et_al_2015
 4    796     -6.4 Bauska_et_al_2015
 5    816     -6.4 Bauska_et_al_2015
 6    838     -6.4 Bauska_et_al_2015
 7    859     -6.5 Bauska_et_al_2015
 8    879     -6.4 Bauska_et_al_2015
 9    880     -6.4 Bauska_et_al_2015
10    903     -6.4 Bauska_et_al_2015
# … with 150 more rows

Figure of Model

p <- ggplot(data = mydata,mapping = aes(x = age_AD, y = d13C_CO2)) + 
  labs(x = "Year (AD)", 
       y = "Atmospheric δ<sup>13</sup>C CO<sub>2</sub>")

p + geom_point(color = "gray65", alpha = 0.5, size = 6) +
  geom_smooth(method = "loess", formula = y ~ x, span = 0.10, size = 1, 
              color = "gray46", fill = "steelblue3", alpha = 0.50) +
  theme_classic() +
  theme(legend.position="none",
        strip.text.x = element_text(color = "#4d4d4d", size = 12, 
                                    face = "bold"),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        axis.line = element_line(color = "#4d4d4d", size = 0.75),
        axis.text.x = element_text(color = "#4d4d4d", size = 20),
        axis.text.y = element_text(color = "#4d4d4d", size = 20),
        axis.title.x = element_text(color = "#4d4d4d", size = 25, 
                                    face = "bold"),
        axis.title.y = element_markdown(color = "#4d4d4d", size = 25, 
                                    face = "bold"),
        axis.ticks.x = element_line(color = "#4d4d4d", size = 0.75),
        axis.ticks.y = element_line(color = "#4d4d4d", size = 0.75)) +
  ylim(-8.35, -6.2)

Model Function

loess_model_old <- loess(d13C_CO2 ~ age_AD, mydata, span = 0.10, 
                    control = loess.control(surface = "direct"))

View Model Results

suess_results_old <- data.frame(age_AD = 753:2019, 
                              pred_d13C = predict(loess_model_old, 
                                                  newdata = data.frame(age_AD = 753:2019), 
                                                  se = TRUE))

as_tibble(suess_results_old)

# A tibble: 1,267 × 5
   age_AD pred_d13C.fit pred_d13C.se.fit pred_d13C.residual.scale pred_d13C.df
    <int>         <dbl>            <dbl>                    <dbl>        <dbl>
 1    753         -6.39           0.0379                   0.0509         124.
 2    754         -6.39           0.0371                   0.0509         124.
 3    755         -6.39           0.0364                   0.0509         124.
 4    756         -6.39           0.0357                   0.0509         124.
 5    757         -6.39           0.0350                   0.0509         124.
 6    758         -6.39           0.0343                   0.0509         124.
 7    759         -6.39           0.0337                   0.0509         124.
 8    760         -6.39           0.0330                   0.0509         124.
 9    761         -6.39           0.0324                   0.0509         124.
10    762         -6.39           0.0317                   0.0509         124.
# … with 1,257 more rows

Some New Code

Library

library(tidymodels)

Split Data

Here, we randomly split the data into a training set (80% of the data) and a testing set (20% of the data). We’ll build the model using the training set and evaluate it’s performance, then we’ll use the testing set to determine how well the model predicts data unused in its construction.

set.seed(990)

suess_split <- initial_split(mydata, prop = 0.80)

suess_train <- training(suess_split)

suess_test <- testing(suess_split)

Fit Model to Training Set

loess_model_new <- loess(d13C_CO2 ~ age_AD, suess_train, span = 0.10, 
                    control = loess.control(surface = "direct"))

Tidy the LOESS Object and Assess Performance Metrics

The below metrics are root mean square error (rmse), R-squared (rsq), and mean absolute error (mae). RMSE can be thought of as the standard deviation of prediction error in the model, R-squared gauges correlation, and MAE is the average amount of prediction error. We can see below that RMSE and MAE are very low, while R-squared is extremely high. The model seems highly effective at predicting yearly δ¹³C CO₂.

tidy_loess <- augment(loess_model_new)

suess_metric <- metric_set(rmse, rsq, mae)

suess_metric(tidy_loess, truth = d13C_CO2, estimate = .fitted)

# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      0.0419
2 rsq     standard      0.995 
3 mae     standard      0.0315

The below observed (x-axis) versus predicted (y-axis) plot illustrates initial model effectviness well. The dashed line is the 1:1 line and is used to illustrate a perfect relationship between observed and predicted values.

ggplot(tidy_loess, aes(x = d13C_CO2, y = .fitted)) +
  geom_abline(lty = 2, size = 1) +
  geom_point(alpha = 0.2, size = 6) +
  labs(x = "Atmospheric δ<sup>13</sup>C CO<sub>2</sub>",
       y = "Predicted Atmospheric δ<sup>13</sup>C CO<sub>2</sub>") +
  coord_obs_pred() +
  theme_bw() +
  theme(axis.title.x = element_markdown(size = 25),
        axis.title.y = element_markdown(size = 25),
        axis.text = element_text(size = 20))

Fit Model to Test Set

Now we can assess how well the model handles new data. The performance metrics (rmse, rsq, and mae) indicate the model handles new data shockingly well. Again, RMSE and MAE are low and R-squared is extremely high.

test_res <- data.frame(predict(loess_model_new, newdata = suess_test, se = TRUE))

test_res <- bind_cols(suess_test, test_res)

suess_metric(test_res, truth = d13C_CO2, estimate = fit)

# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard      0.0557
2 rsq     standard      0.993 
3 mae     standard      0.0449

ggplot(test_res, aes(x = d13C_CO2, y = fit)) +
  geom_abline(lty = 2, size = 1) +
  geom_point(alpha = 0.2, size = 6) +
  labs(x = "Atmospheric δ<sup>13</sup>C CO<sub>2</sub>",
       y = "Predicted Atmospheric δ<sup>13</sup>C CO<sub>2</sub>") +
  coord_obs_pred() +
  theme_bw() +
  theme(axis.title.x = element_markdown(size = 25),
        axis.title.y = element_markdown(size = 25),
        axis.text = element_text(size = 20))

Comparing the New and Old Models

Let’s now turn to predicting atmospheric δ¹³C CO₂ for 1,266 years with the old model and this new model.

suess_results_new <- data.frame(age_AD = 753:2019, 
                              pred_d13C_new = predict(loess_model_new, 
                                                  newdata = data.frame(age_AD = 753:2019), 
                                                  se = TRUE))

compare_models <- bind_cols(suess_results_old, suess_results_new)

The performance metrics below indicate the model results are nearly identical.

suess_metric(compare_models, truth = pred_d13C.fit, estimate = pred_d13C_new.fit)

# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 rmse    standard     0.00947
2 rsq     standard     0.999  
3 mae     standard     0.00710

We visualize the relationship between the old and new model results.

ggplot(compare_models, aes(x = pred_d13C.fit, y = pred_d13C_new.fit)) +
  geom_abline(lty = 2, size = 1) +
  geom_point(alpha = 0.1, size = 6) +
  labs(x = "Old Model Yearly δ<sup>13</sup>C CO<sub>2</sub>",
       y = "New Model Yearly δ<sup>13</sup>C CO<sub>2</sub>") +
  coord_obs_pred() +
  theme_bw() +
  theme(axis.title.x = element_markdown(size = 25),
        axis.title.y = element_markdown(size = 25),
        axis.text = element_text(size = 20))

I think it is safe to say the old model does just fine with predicting yearly atmospheric δ¹³C CO₂.